Motion estimation apparatus and method for scanning an reference macroblock window in a search area

ABSTRACT

A motion estimation technique compares a current macroblock with different reference macroblocks in a reference frame search area. A motion vector for the current macroblock is derived from the reference macroblock most closely matching the current macroblock. To reduce the number of instructions required to load new reference macroblocks, overlapping portions between reference macroblocks are reused and only nonoverlapping portions are loaded into a memory storage device.

BACKGROUND

[0001] This application relies for priority upon Korean PatentApplication No. 2001-40904, filed on Jul. 9, 2001, the contents of whichare herein incorporated by reference in their entirety.

[0002] Video encoders generate bit streams that comply withInternational standards for video compression, such as H.261, H.263,MPEG-1, MPEG-2, MPEG-4, MPEG-7, and MPEG-21. These standards are widelyapplied in the fields of data storage, Internet based image service,entertainment, digital broadcasting, portable video terminals, etc.

[0003] Video compression standards use motion estimation where a currentframe is divided into a plurality of macroblocks (MBs). Dissimilaritiesare computed between a current MB and other reference MBs existing in asearch area of a reference frame. The reference MB in the search areamost similar to the current MB is referred to as the “matching block”and is selected. A motion vector is encoded for the current MB thatindicates a phase difference between the current MB and the matchingblock. The phase difference refers to the location difference betweenthe current MB and the matching block. Since only the motion vector forthe current MB is transmitted, a smaller amount of data has to betransmitted or stored.

[0004] The relationship between the current MB and a search area isshown in FIG. 1. According to a Quarter Common Intermediate Format(QCIF), one frame consists of 176×144 pixels, a current frame 2 consistsof 99 current MBs, and each current MB 10 consists of 16×16 pixels. Amotion vector is computed for the current MB 10 in the reference frame4. A search area 12 in the reference frame 4 includes 48×48 pixels.

[0005] In the search area 12, a 16×16 reference MB that is most similarto the current MB 10 is identified as the matching block. Thedifferences between the current MB and the reference MBs can be computedby a variety of different methods. For example by using the Mean of theAbsolute Difference (MAD), the Mean of the Absolute Error (MAE), or theSum of the Absolute Difference (SAD). The SAD is most popular because itonly requires subtraction and accumulation operations.

[0006]FIG. 2 shows a basic full search in which each pixel 10_1 and 14_1are loaded into 32-bit registers 15 and 17, respectively. The SAD isthen computed using an Arithmetic Logic Unit (ALU) 30. Both the currentMB 10 and the reference MB 14 a are stored in a memory and loaded intothe 32-bit registers 15 and 17 pixel by pixel before being compared bythe ALU 30. Reference MBs 14 a, 14 b, 14 c, . . . etc. existing in thesearch area 12 are compared with the current MB 10 on a pixel by pixelbasis.

[0007] This simple ideal estimation method provides high accuracy.However, the transmission rate is restricted because there are so manycomputations. This method is also unsuitable for real-time encoding withsome general purpose Central Processing Units (CPUs) limited processingcapacity, such as some CPUs used in hand held Personal Computers (PCs).

[0008] A fast search method algorithm (not shown) is used to compute theSAD by comparing a current MB with only a limited number of thereference MBs in the search area. This fast search algorithm candramatically reduce the number of computations compared to the fullsearch method described above. However, the fast search algorithm hasreduced picture quality.

[0009] A quick computation of the SAD has been developed using a fullsearch method. The SAD for a plurality of pixels is computed at the sametime using a Single Instruction Multiple Data (SIMD) method. Thisreduced number of operations improves the transmission rate.

[0010]FIG. 3 illustrates the computation of the SAD using a SIMD device.Eight pixels 10_8 and 14_8 for the current MB 10 and reference MB 14 a,respectively, are loaded into 64-bit registers 16 and 18, respectively.The SIMD machine 20 computes SAD for eight pixels loaded into each ofthe 64-bit registers 16 and 18 at the same time. Unlike a typical fullsearch algorithm in which the SAD is separately computed for each pixel,a simultaneous parallel computation of the SAD for a plurality of pixelsis achieved using the SIMD technique.

[0011] The amount of computation varies depending on the direction thenext MB is shifted in the search area 12. As shown in FIG. 3, whenever anext MB is selected by horizontal shifting, 8 pixels in both the currentMB 10 and the reference MB 14 must be accessed from memory and loadedinto the registers 16 and 18. This large number of memory accessesincreases the amount of time required for deriving motion vectors andincreases power consumption.

[0012] These conventional motion estimation methods are unsuitable inmobile environments because of the large number of memory accesses andassociated large power consumption. The present invention addresses thisand other problems associated with the prior art.

SUMMARY OF THE INVENTION

[0013] A motion estimation technique compares a current macroblock withdifferent reference macroblocks in a reference frame search area. Amotion vector for the current macroblock is derived from the referencemacroblock most closely matching the current macroblock. To reduce thenumber of instructions required to load new reference macroblocks,overlapping portions between reference macroblocks are reused and onlynonoverlapping portions are loaded into a memory storage device.

[0014] The foregoing and other objects, features and advantages of theinvention will become more readily apparent from the following detaileddescription of a preferred embodiment of the invention which proceedswith reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0015]FIG. 1 is a prior art diagram showing how a motion vector isderived.

[0016]FIG. 2 is a prior art diagram illustrating a conventional methodfor performing a motion vector search using Sum of the AbsoluteDifference (SAD) using full search method.

[0017]FIG. 3 is a prior art diagram showing a conventional method forperforming a motion vector search using a Single Instruction MultipleData (SIMD) method.

[0018]FIG. 4 is a block diagram of a system for performing motionestimation according to the present invention.

[0019]FIG. 5 is a diagram of a decimation filter.

[0020]FIG. 6 is a diagram showing a current macroblock and acorresponding search area after decimation.

[0021]FIG. 7 is a diagram showing how two groups of registers are usedaccording to the invention.

[0022]FIG. 8 shows how a reference macroblock is shifted in a searcharea according to the invention.

[0023]FIG. 9 is a flowchart showing how motion vectors are identifiedaccording to the invention.

[0024] FIGS. 10A-10D are charts comparing instruction counts fordifferent motion estimation techniques.

[0025] FIGS. 11A-11D show other differences between conventional motionestimation methods and motion estimation according to the presentinvention.

[0026]FIG. 12 compares a vertical scanning technique according to theinvention with other scanning techniques and shows the difference inmemory access.

[0027]FIG. 13 shows conceptually a part of the dissimilarity computingunit 110 of FIG. 4.

DETAILED DESCRIPTION OF THE INVENTION

[0028] The present invention provides efficient motion estimation thatreduces memory accesses by reusing common registers when scanningreference MBs in a search area.

[0029]FIG. 4 is a block diagram of the preferred embodiment of a motionestimation system according to the present invention. The motionestimation system includes a current frame (C/F) 100, a first registergroup 102, a dissimilarity computing unit 110, a search area (S/A) 104,a second register group 106, and a controller 108. The first and secondregister groups 102 and 106 store pixels for one macroblock (MB) of thecurrent frame 100 and one macroblock of the search area 104,respectively. In one example, the size of one MB is 16×16 pixels. Eachof the first and second register groups 102 and 106 can store an arrayof 16×16 pixels. The controller 108 may be constructed by software orhardware.

[0030]FIG. 5 shows a pre-process step carried out using 4:1 decimationfilters. A n:1 decimation filer is used on the current frame 100 (FIG.4) to reduce required hardware resources. The current frame isrepresented by input frame 130 in FIG. 5. Frame 130 is divided into fourdecimation frames a, b, c and d by four 4:1 decimation filters 126 a,126 b, 126 c and 126 d, and stored in a frame memory 128. A video signaloutput from a charge coupled image capture device (CCD) 120 is convertedinto digital signals through an Analog-to-Digital Converter (ADC) 122.The signal output from the ADC 122 is a RGB signal. A pre-processor 124converts the RGB signal to a YCbCr signal. In one embodiment, only the Ysignal is subjected to decimation by the decimation filter 126.

[0031] The decimation filter 126 a is for pixels a in the input frame130, the decimation filter 126 b is for pixels b, the decimation filter126 c is for pixels c, and the decimation filter 126 d is for pixels d.After the decimation, decimated frames a, b, c, and d are stored in theframe memory 128.

[0032] As a result of the 4:1 decimation for the input frame 130, thesize of one MB reduces to 8×8 pixels. The search area 104 is decimatedin the same ratio as the current frame 130. For example, 4:1 decimationfor a search area of 48×48 pixels reduces the size of the search area to24×24 pixels. FIG. 6 shows one current MB 140 and a corresponding searcharea 150 after 4:1 decimation.

[0033] For convenience of explanation, the current frame is described asone of the four decimation frames a, b, c, and d passed through the 4:1decimation filters of FIG. 5. The size of each MB in the current frame100 has a size of 8×8 pixels and the search area 104 after being passedthrough the 4:1 decimation filters has a size of 24×24 pixels.

[0034] The first register group 102 (FIG. 4) stores one current MB ofthe current frame 100, and the second register group 106 stores onereference MB of the search area 104. The first and second registergroups 102 and 106 store the pixels in a predetermined order showed asthe circled numbers in FIG. 7. The computing order in each of the firstand second register groups 140 and 160 is determined for groups of 8pixels.

[0035]FIG. 7 shows the structures and loading sequences of the first andsecond register groups 102 and 106 in FIG. 4. The first register group140 stores the current MB and includes registers each storing eightpixels. The registers are designated in a predetermined order from 0 to7. The second register group 160 includes registers each storing eightpixels and designated in a predetermined order from 8 to 15. Tocalculate the difference between the current MB stored in the firstregister group 102 and the reference MB stored in the second registergroup 106, the SAD and motion vectors MV for a current reference blockare calculated using the following equation.${S\quad A\quad {D\left( {{x},{y}} \right)}} = \left. {\sum\limits_{m = x}^{x + N - 1}\sum\limits_{n = y}^{y + N - 1}} \middle| {{I_{k}\left( {m,n} \right)} - {I_{k - 1}\left( {{m + {x}},{n + {y}}} \right)}} \middle| \begin{matrix}{\left( {{M\quad V\quad x},{M\quad V\quad y}} \right) = \min} & {S\quad A\quad {D\left( {{x},{y}} \right)}}\end{matrix} \right.$ (x, y) ∈ R²

[0036] where, k(m,n) is the pixel value of the k-th frame at (M,N). Themotion vector (MVx, MVy) represents the displacement of the currentblock to the best match in the reference frame.

[0037] The dissimilarity computing unit 110 (FIG. 4) computes thedifferences of 8 pixels at the same time using the Single InstructionMultiple Data (SIMD) method in FIG. 3.

[0038]FIG. 13 shows conceptually the dissimilarity computing unit 110 ofFIG. 4. An absolute difference value between each pixel of each register142 of the first register group 102 and each pixel of each register 144of the second register group 106 is stored in a register 132. Forexample, the absolute difference value between 142 a and 144 b is storedin 132 a, and the absolute difference value between 142 b and 144 b isstored in 132 b. To calculate the absolute difference between 142 and144, one inner sum instruction is carried out adding each differencevalue stored in a register 132 in dotted block of FIG. 13.

[0039] As shown in the dotted block of FIG. 13, one inner suminstruction is carried out using only multiple adders. In theconventional method in order to add each value, a summation is carriedout using an add instruction and shift instruction, therefore additionalcycles are required compared with the present method. Thus, to calculatethe matching block wholly between the decimated current MB and thedecimated reference MB eight inner sum instructions are carried out.

[0040] Once the SADs for all the pixels of the current MB 10 and thereference MB 14 are computed, an internal sum for the reference MB 14 ais calculated by adding up the SADs for each pixel. After the internalsum for all the reference MBs of the search area 12 are calculated, thereference MB having the least internal sum is identified as the matchingblock, and the result of the computation is output as a difference of MB(E_MB) in FIG. 4. The controller 108 in FIG. 4 controls how thereference MB window is shifted in the search area 104 using the SIMDscanning method to reduce the number of memory accesses.

[0041]FIG. 12 shows in more detail some differences between conventionalscanning methods and the scanning method according to the invention. Fora full search, according to the conventional scanning method, a nextreference block is shifted from a current reference block by one pixelin a horizontal or vertical direction, as shown in FIGS. 12_1 and 12_2,respectively. In these cases, most pixels in the currently comparedreference block overlap with the pixels used in a next comparedreference block.

[0042] For the horizontal scanning shown in FIG. 12_1, only the farright region of the next register group 106′_2 includes new pixels fromthose pixels in register group 106′_1. Likewise, for the verticalscanning shown in FIG. 12_2, only the lower region of the next registergroup 106″_2 includes new pixels compared with the current registergroup 106″_1. Even though only the edge regions include new pixels,memory accesses are performed for the entire reference macroblock 106.

[0043] A vertical scanning for SIMD scheme according to the presentinvention is shown in FIG. 12_3. Only new pixels 106′″_2 are loaded frommain memory into the second register group 106 in FIG. 4. As shown inFIG. 7, the second register group 160 b reuses the overlapping pixelsstored in register regions 9 through 15 of the first register group 160a. Only the first register region 8 of the second register group 160 ais loaded with a new row of pixel values. The first register region 8 ismoved down to the last position in the second register group 160 b. Theother register regions 9-15 that store rows of pixels that overlap witha next reference block are moved up in the sequence by one. For example,register region 9 is moved to a first position, register 10 is moved toa second position, register 11 is moved to a third position, etc.

[0044] This shifting of the reference MB requires only one memory accessto read a new nonoverlapping row of pixels for each vertical shift inthe search area 104 (FIG. 4). Since the entire 8×8 pixel array for thenext reference MB does not have to be read from memory, the number ofmemory accesses for scanning the search area 104 is reduced.

[0045]FIG. 8 shows the shifting of the reference MB in the search area104. The reference MB window is vertically scanned under the control ofthe controller 108 in FIG. 4. The reference MB window is verticallyshifted by one row of pixels at a time. While this shows vertical windowshifting, the same technique can be used for horizontal window shifting.Horizontal shifting could be used when pixels are stored in sequentiallocations in memory along vertical columns of the current and referenceframes.

[0046] As described above, when registers capable of storing data forone MB are used and a reference MB window is vertically shifted in asearch area, overlapping pixels between a current reference MB and anext reference MB are reused. This reduces the number of memory accessesrequired by the controller 108 to scan the search area. The current MBis stored in the first register group, and the current reference MB isstored in the second register group.

[0047]FIG. 9 is a flowchart showing in more detail the SIMD scanningscheme according to the present invention. A current frame and areference frame are decimated in a ratio of n:1 in step 170. Forconvenience of explanation, n=4 in the present embodiment. A parameterHS indicates the position of the last column of the first reference MBin the search area, a parameter VS indicates the position of the lastlow of the first reference MB in the search area, and a parameter DCMindicates four decimation frames.

[0048] Here, the first reference MB is the left uppermost MB in thesearch area, and the first parameter HS and the second parameter VS forthe first reference MB are zero. In step 172, the parameters HS, VS, DCMare all initialized to zero, and a minimum dissimilarity E_MIN isinitialized with a value as large as possible, for example, infinity.

[0049] Identification Nos. 0, 1, 2, and 3 are assigned to the fourdecimation frames, respectively. The parameter DCM is compared to thevalue 4 in step 174 to determine whether motion estimation is completedfor the last decimation frame. If motion estimation is not completed forthe last decimation frame, a current MB is loaded into the firstregister group 140 (see FIG. 7) in step 176.

[0050] It is determined in step 178 whether the HS parameter is lessthan 17. When the HS parameter is not less than 17, the motionestimation is completed for the last column (HS16) in the search area.HS is reset to zero in step 192 and DCM is incremented to the next DCMframe in block 198. The process then returns to step 174.

[0051] If motion estimation is not completed up to HS16, it isdetermined whether the VS parameter is less than 17 in step 180. If VSis less than 17, a pipelining procedure is performed in steps 182 and184. Only the last row VS1 is loaded into the reference MB in step 182(see FIG. 8). If the motion estimation is not completed up to the lastlow, i.e., if a reference MB window is not shifted to the last row VS16,the reference MB is loaded into the second register group 160 a in step182. The difference between the current MB and the reference MB iscalculated in step 184.

[0052] In this case, the new row VS1 in the vertical direction is storedin the first register position in the sequence of register regions. Forexample, $register 8 of the second register group 160 a is loaded withthe next new nonoverlapping row of pixels for the next reference MB. Theother register regions, i.e., $register 9 through $register15, are movedup in the sequence by one. That is, the second register group 106 b inFIG. 7 reuses the pixels stored in the register regions $register9through $register15. Thus, only the pixels of the new row VS1 (FIG. 8)are accessed from memory and stored in the register region $register8 ofthe second register group 160 a.

[0053] In step 184, the difference between MBs loaded into the first andsecond register groups 140 and 160 in FIG. 7 are computed. The MBdissimilarity E_MB is compared with the minimum dissimilarity E_MIN instep 186. If the MB dissimilarity E_MB is less than the minimumdissimilarity E_MIN, the minimum dissimilarity E_MIN is set to the MBdissimilarity E_MB in step 188. If the MB dissimilarity E_MB is not lessthan the minimum dissimilarity E_MIN, the current minimal dissimilarityE_MIN is maintained, and the parameter VS is incremented in step 190.Then steps 180 through 190 are repeated until vertical scanning of thereference MB reaches the last low VS16 (FIG. 8).

[0054] If it is determined in step 180 that the second parameter VS isnot less than 17 as a result of scanning the last row VS16, theparameter VS is initialized to zero in step 200. The parameter HS isincremented in step 202, and the process returns to step 178. In otherwords, the reference MB window is shifted one pixel position to theright. Steps 180-190 are then repeated.

[0055] After the reference MB window is shifted in a horizontaldirection to the last column HS16, i.e., if it is determined in step 178that the parameter HS is not less than 17, the first parameter HS isreinitialized to zero in step 192. The DCM parameter is incremented instep 198 and the process returns to step 174. Incrementing the DCMparameter means that motion estimation for another decimation frame isperformed.

[0056] When motion estimation is completed for all the decimationframes, i.e., if it is determined in step 174 that the DCM parameter isnot less than 4, the reference MB with the least dissimilarity isidentified as the matching block in step 204. Motion estimation for thecurrent frame is completed by repeating the processes described abovefor all the MBs of the current frame.

[0057] As described above, the first and second register groups store acurrent MB and a reference MB. The reference MB window is verticallyshifted in a search area for motion estimation. Overlapping pixelsbetween a current reference MB and a next reference MB are reused. As aresult, fewer instructions (Load/Store) are required when loading thenext reference MB into the second register groups. This allows fastermotion estimation with less power consumption.

[0058]FIGS. 10a through 10 d show the advantages of the presentinvention over conventional motion estimation methods. FIG. 10aidentifies the instruction count for a conventional motion estimationmethod in which decimation is not performed, i.e., full searchalgorithm. It was determined that 26.2% of the total instruction countfor the conventional method of FIG. 10a is required for memory accessinstruction and the remaining 73.8% of the instruction counts are fornon-memory accessing. FIG. 10a corresponds to FIG. 2 where a referenceMB is horizontally shifted in a search area and motion estimation iscarried out using SAD for each pixel. FIG. 10b shows total instructioncount for a conventional motion estimation method where decimation isperformed. FIG. 10c shows the total instruction count for conventionalmotion estimation in which decimation and SIMD are used.

[0059]FIG. 10d shows the total instruction count for the motionestimation using the present invention. For the three cases shown inFIGS. 10b through 10 d, the percentages 27.0%, 1.6%, and 0.9%,respectively, are a relative ratio of the memory access instructioncounts compared with the conventional motion estimation method of FIG.10a. It is apparent that the orthogonal scanning method to access thenon-overlapped portion is the most efficient technique for reducing thememory access count.

[0060]FIG. 11 shows the number of total clock cycles required for 2frames having the Quarter Common Intermediate Format (QCIF) required toextract 99 minimum SADs. In FIGS. 11, 11a corresponds to FIGS. 10a, 11 bcorresponds to FIGS. 10b, 11 c corresponds to FIGS. 10c, and 11 dcorresponds to FIG. 10d. The performance of the orthogonal scanningscheme to access the non-overlapped portion is twice the improvementover the conventional motion estimation method using normal SIMD.

[0061] The scanning technique described above can be implemented with aSingle Instruction Multiple Data (SIMD) device or a Very LongInstruction Word (VLIW) device for comparing the current macroblock withthe reference macroblock. The scheme used for matching macroblocks caninclude a Mean of the Absolute Difference (MAD), Mean of the AbsoluteError (MAE), or Sum of the Absolute Difference (SAD) scheme. The methodfor selecting the next reference macroblock can include a fast algorithmor full search algorithm. Of course, other single instruction/multi-datadevices, matching schemes, and searching algorithms can also be used.

[0062] The invention may be embodied in a general purpose digitalcomputer by running a program from a computer usable medium, includingbut not limited to storage media such as magnetic storage media (e.g.,ROM's, floppy disks, hard disks, etc.), optically readable media (e.g.,CD-ROMs, DVDs, etc.) and carrier waves (e.g., transmissions over theInternet). The computer usable medium can be stored and executed indistributed computer systems connected by a network.

[0063] The system described above can use dedicated processor systems,micro controllers, programmable logic devices, or microprocessors thatperform some or all of the operations. Some of the operations describedabove may be implemented in software and other operations may beimplemented in hardware.

[0064] For the sake of convenience, the operations are described asvarious interconnected functional blocks or distinct software modules.This is not necessary, however, and there may be cases where thesefunctional blocks or modules are equivalently aggregated into a singlelogic device, program or operation with unclear boundaries. In anyevent, the functional blocks and software modules or features of theflexible interface can be implemented by themselves, or in combinationwith other operations in either hardware or software.

[0065] Having described and illustrated the principles of the inventionin a preferred embodiment thereof, it should be apparent that theinvention may be modified in arrangement and detail without departingfrom such principles. Claimed are all modifications and variationscoming within the spirit and scope of the following claims.

1. An image processing apparatus, comprising: a first storage element adapted to store a current macroblock; a second storage element adapted to store a first reference macroblock; a computing unit to compute a difference between contents of the first storage element and the second storage element; and a controller adapted to load a second reference macroblock into the second storage element by replacing a nonoverlapping portion of the first reference macroblock with a nonoverlapping portion of the second reference macroblock.
 2. An image processing apparatus of claim 1 wherein results of the computing unit are used for determining a motion vector.
 3. An image processing circuit of claim 1 wherein the computing unit includes a Single Instruction Multiple Data (SIMD) device.
 4. An image processing apparatus according to claim 1 wherein portions of the first reference macroblock that are overlapping with portions of the second reference macroblock are reused in the second storage element by the computing unit to compute the difference between the first storage element and the second storage element.
 5. An image processing apparatus according to claim 1 wherein the first storage element comprises multiple registers each storing a group of pixel values for the current macroblock and the second storage element comprises multiple registers storing a group of pixel values for the first reference macroblock.
 6. An image processing apparatus according to claim 5 wherein the computing unit compares the group of pixel values stored in each register of the first storage element with the group of pixels values stored in each register of the second storage element at the same time.
 7. An image processing apparatus according to claim 5 wherein each one of the multiple registers in the first storage element stores a row or a column of the current macroblock and each one of the multiple registers in the second storage element stores a row or a column of the first reference macroblock.
 8. An image processing apparatus according to claim 1 wherein the nonoverlapping portion of the second reference macroblock is loaded from a memory into the second storage element.
 9. An image processing apparatus according to claim 1 wherein the controller loads the second reference macroblock into the second storage element by moving a first register position storing nonoverlapping portion to a last register position in the second storage element and moving up in order other registers in the second storage element storing overlapping portions of the first reference macroblock.
 10. An image processing apparatus according to claim 1 including a preprocessor that decimates a current frame into multiple decimated current frames and decimates a reference frame into multiple decimated reference frames.
 11. An image processing apparatus according to claim 1 wherein the controller and the computing unit are implemented in either software or hardware.
 12. An image processing apparatus according to claim 5 wherein the computing unit includes: a third storage element adapted to store absolute differences between each pixel of each register of the first storage element and each pixel of each register of the second storage element; and a summation circuit for deriving a summation for the absolute difference values stored in the third storage element.
 13. An image processing apparatus according to claim 12 wherein the summation circuit comprises only multiple adders.
 14. An image processing apparatus according to claim 12 wherein a single inner sum instruction causes the summation circuit to generate the summation for all of the absolute difference values stored in the third storage element.
 15. A motion estimation method, comprising: loading a current macroblock; loading a current reference macroblock; comparing the current macroblock with the current reference macroblock; and loading a next reference macroblock by replacing a nonoverlapping portion of the loaded current reference macroblock with a nonoverlapping portion of the next reference macroblock.
 16. A method according to claim 15 including reusing an overlapping portion of the current reference macroblock for comparing the next reference macroblock with the current macroblock.
 17. A method according to claim 15 including: loading in one instruction a nonoverlapping group of pixels from the next reference macroblock into an identified register that currently contains a nonoverlapping portion of pixels for the current reference macroblock; and reusing pixels in other registers that overlap with the next reference macroblock.
 18. A method according to claim 17 including loading the identified register from a memory storing a reference frame.
 19. A method according to claim 17 including moving an order of the identified register storing the nonoverlapping protion of the next reference macroblock to a last register position and moving up the order of the other registers.
 20. A method according to claim 15 including comparing each group of pixel values for the loaded current macroblock with each group of pixel values for the loaded current reference macroblock at the same time.
 21. A method according to claim 20 wherein the group of pixel values each comprise a row or column of the current macroblock or a row or column of the current reference macroblock.
 22. A method according to claim 15 including using a Single Instruction Multiple Data (SIMD) device or a Very Long Instruction Word (VLIW) device for comparing the current macroblock with the current reference macroblock.
 23. A method according to claim 15 including comparing the current macroblock with the current reference macroblock using a matching macroblock scheme.
 24. A method according to claim 23 wherein the matching macroblock scheme is Mean of the Absolute Difference (MAD), Mean of the Absolute Error (MAE), or the Sum of the Absolute Difference (SAD).
 25. A method according to claim 15 including selecting the next reference macroblock using a fast algorithm or full search algorithm.
 26. A method according to claim 15 including: decimating a current frame into multiple decimated current frames; decimating a reference frame into multiple decimated reference frames; selecting the current macroblock from the decimated current frames; shifting the selected current macroblock over search areas of the decimated reference frames to identify a reference macroblock most similar to the current macroblock; and deriving a motion vector for the identified reference macroblock.
 27. A method according to claim 20 including: storing absolute differences between each group of pixel values for the loaded current macroblock with each group of pixel values for the loaded current reference macroblock; and deriving a summation of the absolute difference values.
 28. A method according to claim 27 including using only adders to derive the summation for the absolute difference values.
 29. A method according to claim 28 including using a single inner sum instruction to generate the summation for all of the absolute difference values. 